Towards Neural Machine Translation with Partially Aligned Corpora

نویسندگان

  • Yining Wang
  • Yang Zhao
  • Jiajun Zhang
  • Chengqing Zong
  • Zhengshan Xue
چکیده

While neural machine translation (NMT) has become the new paradigm, the parameter optimization requires large-scale parallel data which is scarce in many domains and language pairs. In this paper, we address a new translation scenario in which there only exists monolingual corpora and phrase pairs. We propose a new method towards translation with partially aligned sentence pairs which are derived from the phrase pairs and monolingual corpora. To make full use of the partially aligned corpora, we adapt the conventional NMT training method in two aspects. On one hand, different generation strategies are designed for aligned and unaligned target words. On the other hand, a different objective function is designed to model the partially aligned parts. The experiments demonstrate that our method can achieve a relatively good result in such a translation scenario, and tiny bitexts can boost translation quality to a large extent.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CS671A Natural Language Processing Hindi ↔ English Parallel Corpus Generation from Comparable Corpora for Neural Machine Translation

Neural Machine Translation (NMT) is a new approach to the well-studied task of machine translation, which has significant advantages over traditional approaches in terms of reduced model size, and better performance. NMT models require a parallel corpus of significant size to be trained, which is lacking for the Hindi ↔ English language pair. However, significant amounts of comparable corpora a...

متن کامل

Building Strong Multilingual Aligned Corpora

Recent advances have allowed algorithms that learn from aligned natural language texts to exploit aligned sentences in more than two languages. We investigate ways of combining ( N 2 ) bilingual aligned corpora together to create a multilingual aligned corpus across N languages. As a result of the combination of several corpora, our algorithms output a multilingual corpus, with each aligned tup...

متن کامل

Building Parallel Corpora for SMT System: A Case Study of English-Manipuri

The Statistical Machine Translation (SMT) systems are developed using sentence aligned parallel corpus. The difficulty is that there is no parallel corpus at the required measure for many language pairs. The preparation of large scale parallel corpus takes time and demands the linguistics skill. In the present work, the various issues of a quality parallel corpus and a technique that extracts p...

متن کامل

Using Machine Translation to Convert Between Difficulties in Rhythm Games

A method is presented for converting between Guitar Hero difficulty levels by treating the problem as one of machine translation, with the different difficulties as different “languages.” The Guitar Hero I and II discs provide aligned corpora with which to train bigrambased language models and translation models. Given an Expert sequence, the model can create sequences of Hard, Medium, or Easy ...

متن کامل

Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora

The parameters of statistical translation models are typically estimated from sentence-aligned parallel corpora. We show that significant improvements in the alignment and translation quality of such models can be achieved by additionally including wordaligned data during training. Incorporating wordlevel alignments into the parameter estimation of the IBM models reduces alignment error rate an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017